# You can write code and comments in "chunks"
print("Hello World") # start "comments" with the # symbol[1] "Hello World"
2024 NUTRIM microbiome & metabolome workshop
In this workshop we will use an online version of RStudio - via posit.cloud
Sign up for a free posit.cloud account
(use a GitHub or Google account if you already have one of these)
Check your email for a link to join the posit.cloud space for this course
Click on the 2024-NUTRIM-microbiome project - this will start RStudio
Go to Files (look bottom right) > click folder practical-1 > open file practical-1A-notebook.qmd
We will write code and notes in Quarto .qmd documents, e.g.Β practical-1A-notebook.qmd
You can write normal text for notes, e.g.Β your intentions, hypotheses, observations, etc.
# You can write code and comments in "chunks"
print("Hello World") # start "comments" with the # symbol[1] "Hello World"
Insert a code chunk with Ctrl+Alt+I (Windows) or Cmd+Option+I (Mac).
Run the current chunk by clicking the green play button in the corner of the chunk
Run all previous chunks by clicking the other button in the corner of the chunk
# How to run code with keyboard shortcuts?
print("Run this one line with Ctrl + Enter, or Cmd + Enter on Mac")
print("or run a whole chunk with Ctrl/Cmd + SHIFT + Enter")To edit your Quarto notebooks in RStudio you can use βVisualβ editor mode, or the βSourceβ mode.
Toggle between modes at the top left. Visual editor is easier for beginners, but Source mode shows how the Quarto file truly looks (it only contains markdown text, but RStudio adds formatting in Visual mode).
It is a good idea to load all the packages you need at the top of your notebook.
# `readxl` is for reading data from Excel files
library(readxl)# `writexl` is for writing data to a new Excel file
library(writexl) # we'll use this at the end# the `here` package makes specifying file locations easier
library(here)# `tidyverse` is a collection of several packages
library(tidyverse) # (dplyr, ggplot2, and others)How do we read a table of data from a file, e.g.Β an Excel file?
# give your objects short but informative names!
meta <- read_excel(path = here("data/papa2012/papa2012_metadata.xlsx"), na = "NA")
meds <- read_excel(here("data/papa2012/papa2012_metadata.xlsx"), sheet = "treatment")There are multiple ways to tell R where to find the data files for a project.
π The best way π
For each data analysis project you do, create a separate folder for that project, and keep all relevant code and data inside that folder.
RStudio also offers additional convenient features for organisation with RStudio Projects!
Using the R package called here you can easily specify the location of your data using file paths relative to the project folder. The here package offers one important function, also called here()
# this will work anywhere! (anywhere that the project folder is moved or copied)
meta <- read_excel(path = here("data/papa2012/papa2012_metadata.xlsx"), na = "NA")This is reliable and portable! π€© If you share the entire project folder with a collaborator (e.g.Β as a zip file or via github) then this code will work on their computer without needing any changes!
π Easy but limiting π
An absolute path (also known as a full path) can be used to specify where a file is on your own computer.
This will work okay, but only if you never move your files, and never change computerβ¦
Using absolute paths makes it inconvenient to share your project or work collaboratively. π
# this works on my machine, but it won't work anywhere else!
meta <- read_excel(
path = "/Users/david/Documents/teaching/workshops/2024-NUTRIM-microbiome/data/papa2012/papa2012_metadata.xlsx",
na = "NA"
)π€· Commonly used but still problematic π€·
setwd("/an/absolute/path/on/your/computer"), BUT:setwd still uses an absolute path, so it is still not portable!
setwd does not work in notebooks, which reset working directory every chunk
Prof.Β Jenny Bryan might set your computer on fire! π₯οΈπ₯ (read why here)
# this works on my machine, but it won't work anywhere else!
setwd("/Users/david/Documents/teaching/workshops/2024-NUTRIM-microbiome")
meta <- read_excel(path = "data/papa2012/papa2012_metadata.xlsx", na = "NA")Start by printing the metadata data frame, like this:
meta# A tibble: 90 Γ 10
ID sample case_control diagnosis activity active gender ethnicity
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 099A 099-AX Case UC severe active female White
2 199A 199-AX Control Other control control female Other
3 062B 062-BZ Case CD mild active male White
4 194A 194-AZ Case UC mild active male White
5 166A 166-AX Case UC severe active female Black
6 219A 219-AX Case UC inactive inactive female <NA>
7 132A 132-AX Case CD mild active female White
8 026A 026-AX Case UC mild active male White
9 102A 102-AZ Control Other control control male White
10 140A 140-AX Control Other control control female White
# βΉ 80 more rows
# βΉ 2 more variables: family_history <chr>, age_years <dbl>
A dataframe is the standard class of object for holding rectangular data (tables) in your R environment.
Columns of a dataframe can hold vectors of different classes, e.g.Β chr and int (characters and integers)
(In contrast, a matrix can only hold one type of data!)
example_df <- data.frame(alphabet = LETTERS, numbers = 1:26)Try View(example_df) in the Console - but not in a Quarto doc chunk
A tibble is just a dataframe, but with a concise print format, useful!
example_tbl <- as_tibble(example_df)
example_tbl# A tibble: 26 Γ 2
alphabet numbers
<chr> <int>
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
6 F 6
7 G 7
8 H 8
9 I 9
10 J 10
# βΉ 16 more rows
Each column in a dataframe/tibble can be on of the following types of vector object
c(7.5, -2.301, 0.666).c(9L, 4L, -123L).c(TRUE, FALSE).c("no", "yes", "unsure").factor(c("no", "yes", "unsure")).table(meta$family_history, useNA = "ifany")
fhx nofhx <NA>
25 54 11
table(meta$age_years, useNA = "ifany")
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 24
2 3 5 1 2 5 5 4 6 10 6 9 7 7 4 4 5 3 1 1
You can count the number of times a category occurs, similar to the table function.
meta %>% count(diagnosis)# A tibble: 3 Γ 2
diagnosis n
<chr> <int>
1 CD 23
2 Other 24
3 UC 43
Or count combinations of categories.
meta %>% count(family_history, diagnosis)# A tibble: 9 Γ 3
family_history diagnosis n
<chr> <chr> <int>
1 fhx CD 8
2 fhx Other 4
3 fhx UC 13
4 nofhx CD 12
5 nofhx Other 14
6 nofhx UC 28
7 <NA> CD 3
8 <NA> Other 6
9 <NA> UC 2
Or compute other summary statistics.
meta %>% summarise(age_mean = mean(age_years), age_sd = sd(age_years))# A tibble: 1 Γ 2
age_mean age_sd
<dbl> <dbl>
1 12.6 4.65
Or compute grouped summary statistics
meta %>% summarise(age_mean = mean(age_years), age_sd = sd(age_years), .by = diagnosis)# A tibble: 3 Γ 3
diagnosis age_mean age_sd
<chr> <dbl> <dbl>
1 UC 13.8 4.26
2 Other 9.08 4.30
3 CD 14.1 3.84
Rβs base graphics can be used to quickly summarise data distributions.
ggplot2 is a popular and powerful plotting package.
ggsave to save your plots? π¦ΈππΎ
# Assign your plot to an R object
plot_practice_bars <- ggplot(meta) +
geom_bar(aes(y = activity, fill = diagnosis)) +
ggtitle("Practice bar chart") +
theme_bw()
# create a folder for figures
dir.create(here("practical-1/figs"))
# Write the plot to a file, with ggsave
ggsave(
plot = plot_practice_bars,
filename = here("practical-1/figs/practice-barchart.png"),
width = 6, height = 3, units = "in", dpi = 300
)Be sure to carefully adjust the sizing and resolution of your plots for your paper or presentation!
e.g.
We will use ggplot2 throughout much of this workshop.
It is a powerful and flexible tool, but it takes practice to learn the details.
Links to good resources for learning ggplot2
R 4 Data Science book - intro to plots in R
R Graphics Cookbook - quick practical guide
R graph gallery - ideas and example code
ggplot2 website - a reference manual
ggplot2 book - for a thorough grounding
Cedric Schererβs blog - tutorial/examples
Metaprogramming, Advanced R - experts only π€―
library(patchwork)
plot_list <- list()
# age boxplot
plot_list$box <- meta %>%
ggplot(aes(y = case_control, x = age_years)) +
geom_boxplot(
mapping = aes(fill = case_control),
width = 0.15, staplewidth = 0.5, outliers = FALSE,
position = position_nudge(y = 0.2),
show.legend = FALSE
) +
geom_jitter(height = 0.05, width = 0.1, alpha = 0.8, size = 1) +
labs(y = NULL, x = "Age (years)", title = "Age Distributions") +
theme_bw() +
theme(plot.margin = margin(b = 20, r = 20))
# plot_list$box
# family history piechart
plot_list$pie <- meta %>%
ggplot(aes(y = 1, fill = family_history)) +
geom_bar(position = "stack", colour = "black") +
scale_fill_manual(
values = c(fhx = "grey30", nofhx = "grey70"),
na.value = "white", guide = "none"
) +
annotate("text", x = I(pi / 8), y = I(0.25), label = "NA") +
annotate("text", x = I(0.9 * pi), y = I(0.25), label = "No FHx") +
annotate("text", x = I(1.7 * pi), y = I(0.25), label = "FHx", colour = "white") +
labs(tag = "Family History") +
labs(x = NULL, y = NULL) +
coord_radial(expand = FALSE, inner.radius = 0.3) +
theme_void() +
theme(plot.tag.location = "panel", plot.margin = margin(t = 10))
# plot_list$pie
# activity barchart
plot_list$bar <- meta %>%
ggplot(aes(
x = diagnosis,
fill = factor(activity, c("severe", "moderate", "mild", "inactive"))
)) +
geom_bar(colour = "black", linewidth = 0.2) +
scale_x_discrete(limits = c("UC", "CD")) +
scale_fill_brewer(name = "Activity", palette = "Reds", direction = -1) +
labs(title = "Activity Level", x = NULL, y = NULL) +
coord_cartesian(ylim = c(0, NA), expand = FALSE) +
theme_classic() +
theme(
plot.margin = margin(l = 20),
legend.key.height = unit(1, "cm"), legend.key.width = unit(3, "mm")
)
# plot_list$bar
# assemble with patchwork
wrap_plots(
A = plot_list$box, C = plot_list$bar,
B = plot_list$pie, guides = "collect",
design =
"AAC
AAC
BBC
BBC
BBC"
)Often, your data are not all in one table. For example, there were two sheets of data in the metadata Excel file, which we stored in the dataframe objects meta and meds
The main dataframe meta contains most of the data about each patient:
meta# A tibble: 90 Γ 10
ID sample case_control diagnosis activity active gender ethnicity
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 099A 099-AX Case UC severe active female White
2 199A 199-AX Control Other control control female Other
3 062B 062-BZ Case CD mild active male White
4 194A 194-AZ Case UC mild active male White
5 166A 166-AX Case UC severe active female Black
6 219A 219-AX Case UC inactive inactive female <NA>
7 132A 132-AX Case CD mild active female White
8 026A 026-AX Case UC mild active male White
9 102A 102-AZ Control Other control control male White
10 140A 140-AX Control Other control control female White
# βΉ 80 more rows
# βΉ 2 more variables: family_history <chr>, age_years <dbl>
The second dataframe meds contains medications info for the IBD cases:
meds# A tibble: 67 Γ 4
ID diagnosis immunosuppression_level medication
<chr> <chr> <chr> <chr>
1 048A UC level0 mesalamine only
2 119A UC level0 mesalamine only
3 120D UC level0 mesalamine only
4 192A UC level0 mesalamine only
5 215A UC level0 mesalamine only
6 216A UC level0 mesalamine only
7 233A CD level0 mesalamine only
8 009A UC level1 steroids
9 035A UC level1 steroids
10 038A UC level1 steroids
# βΉ 57 more rows
dplyr provides functions to βjoinβ dataframes together, using shared variables.
all_meta <- left_join(meta, meds)Joining with `by = join_by(ID, diagnosis)`
By default the join function will perform a βnaturalβ join using all shared variables. For greater control you can specify a βkeyβ variable, or set of variables, that should be used.
# this should do the same thing as the natural join shown above
all_meta <- left_join(meta, meds, by = join_by(ID, diagnosis))What happened with the Control group? (remember they were not present in meds!)
Inspect the all_meta dataframe to find out! e.g.Β View(all_meta)
A natural left join is the most common join you will need.
But there are many other useful possibilities, e.g.
Learn more at:
Often, you need to modify your variables, or create new ones.
For simple transformations you can easily do this with base R.
# create a logical variable, TRUE if patient has family history of IBD
all_meta$ibd_fhx <- all_meta$family_history == "fhx"
# always check the result is what you expected!
all_meta[, c("ID", "family_history", "ibd_fhx")]# A tibble: 90 Γ 3
ID family_history ibd_fhx
<chr> <chr> <lgl>
1 099A nofhx FALSE
2 199A <NA> NA
3 062B nofhx FALSE
4 194A nofhx FALSE
5 166A <NA> NA
6 219A nofhx FALSE
7 132A nofhx FALSE
8 026A nofhx FALSE
9 102A fhx TRUE
10 140A nofhx FALSE
# βΉ 80 more rows
The mutate function from dplyr is great for making multiple or complex transformations. You refer to variables without repeating the name of the dataframe. It is as if you are working βinsideβ the dataframe.
# this is equivalent to the previous block
all_meta <- all_meta %>% mutate(ibd_fhx = family_history == "fhx")
# check the result again
all_meta %>% select(ID, family_history, ibd_fhx)# A tibble: 90 Γ 3
ID family_history ibd_fhx
<chr> <chr> <lgl>
1 099A nofhx FALSE
2 199A <NA> NA
3 062B nofhx FALSE
4 194A nofhx FALSE
5 166A <NA> NA
6 219A nofhx FALSE
7 132A nofhx FALSE
8 026A nofhx FALSE
9 102A fhx TRUE
10 140A nofhx FALSE
# βΉ 80 more rows
We can convert character string variables to factor variables to specify an ordering of their levels (e.g.Β for plotting).
You can temporarily mutate a dataframe and use the result. This is easy with pipes.
all_meta %>%
mutate(diagnosis = factor(diagnosis, levels = c("CD", "UC", "Other"))) %>%
ggplot(aes(diagnosis, age_years)) +
geom_boxplot()Notice this did NOT persistently modify the diagnosis variable, it is still βcharacterβ class.
class(all_meta$diagnosis) # no persistent changes, diagnosis is still character![1] "character"
We need to clean up the medication history.
table(all_meta$medication, useNA = "if")
abx abx, imsp imsp mesalamine only
1 3 14 7
steroids steroids, abx steroids, abx, imsp steroids, imsp
12 4 11 11
<NA>
27
table(all_meta$immunosuppression_level, useNA = "if")
level0 level1 level2 level3 level4 none <NA>
7 16 18 11 10 4 24
We have NAs for the medication for all the Controls. We know they had βnoneβ, so letβs first indicate that. We will also replace Controlsβ immunosuppression_level NA values with βnoneβ.
all_meta <- all_meta %>% mutate(
medication = if_else(case_control == "Control", true = "none", false = medication),
immunosuppression_level = if_else(case_control == "Control", true = "none", false = immunosuppression_level)
)
# check the result!
all_meta %>% select(ID, case_control, medication, immunosuppression_level)# A tibble: 90 Γ 4
ID case_control medication immunosuppression_level
<chr> <chr> <chr> <chr>
1 099A Case steroids, imsp level3
2 199A Control none none
3 062B Case steroids, imsp level3
4 194A Case steroids level1
5 166A Case abx none
6 219A Case abx, imsp level3
7 132A Case steroids, imsp level2
8 026A Case steroids, abx, imsp level4
9 102A Control none none
10 140A Control none none
# βΉ 80 more rows
We now want logical variables indicating if the patient recently had antibiotics (abx), steroids, or other immunosuppressive drugs (imsp).
all_meta <- all_meta %>% mutate(
abx = str_detect(medication, "abx"),
steroids = str_detect(medication, "steroids"),
imsp_other = str_detect(medication, "imsp"),
imsp_any = steroids | imsp_other
)
# check the result!
# all_meta %>% select(case_control, medication, abx, steroids, imsp_other, imsp_any)For the character variables with more than two values, we can convert them to factors, to encode our preferred ordering of their levels.
In R stats methods, the 1st level of a factor is often used as the reference level by default. So we should keep this in mind, and typically set control levels as the 1st factor level. We can always temporarily change this for plots.
all_meta <- all_meta %>% mutate(
diagnosis = factor(diagnosis, c("Other", "CD", "UC")),
active = factor(active, c("control", "inactive", "active")),
activity = factor(activity, c("control", "inactive", "mild", "moderate", "severe"))
)
# you do the same for immunosuppression_level! don't forget to check the resultLetβs save the all_meta dataframe in three formats: Excel, CSV, and RDS, all within a folder called βprocessedβ inside the βdataβ folder.
First, weβll create the βprocessedβ directory.
output_path <- here("data", "papa2012", "processed")
dir.create(output_path)Warning in dir.create(output_path):
'/Users/david/Documents/git/R-projects/teaching/workshops/2024-NUTRIM-microbiome/data/papa2012/processed'
already exists
Then we write an excel file, using the write_xlsx function from the writexl package.
write_xlsx(all_meta, here(output_path, "all_metadata.xlsx"))CSV files are a common format for storing tabular data. We use the write_csv function from the readr package.
write_csv(all_meta, here(output_path, "all_metadata.csv"))RDS files are Rβs native format for saving single R objects. They preserve data types and structures, making them ideal for saving and loading a variety of R objects.
saveRDS(all_meta, here(output_path, "all_metadata.rds"))We have recapped some R fundamentals and introduced the sample metadata.
Next we will start working with microbiome data!
Click here: https://david-barnett.github.io/2024-NUTRIM-microbiome/practical-1/web/practical1B-instructions.html
For reproducibility, it is useful to record the packages and versions used in your analyses. This is easy to do with sessioninfo::session_info().
sessioninfo::session_info()β Session info βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
setting value
version R version 4.4.0 (2024-04-24)
os macOS Sonoma 14.5
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Amsterdam
date 2024-06-29
pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
β Packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
package * version date (UTC) lib source
bit 4.0.5 2022-11-15 [1] CRAN (R 4.4.0)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.4.0)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.4.0)
cli 3.6.2 2023-12-11 [1] CRAN (R 4.4.0)
codetools 0.2-20 2024-03-31 [1] CRAN (R 4.4.0)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.4.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.4.0)
digest 0.6.35 2024-03-11 [1] CRAN (R 4.4.0)
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
evaluate 0.23 2023-11-01 [1] CRAN (R 4.4.0)
fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)
farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0)
fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)
ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)
glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0)
gtable 0.3.5 2024-04-22 [1] CRAN (R 4.4.0)
here * 1.0.1 2020-12-13 [1] CRAN (R 4.4.0)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0)
htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)
knitr 1.47 2024-05-29 [1] CRAN (R 4.4.0)
labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.0)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)
patchwork * 1.2.0 2024-01-08 [1] RSPM (R 4.4.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)
ragg 1.3.2 2024-05-15 [1] CRAN (R 4.4.0)
RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.4.0)
readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)
readxl * 1.4.3 2023-07-06 [1] CRAN (R 4.4.0)
rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)
rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.4.0)
rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)
scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)
stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)
systemfonts 1.1.0 2024-05-15 [1] CRAN (R 4.4.0)
textshaping 0.4.0 2024-05-24 [1] CRAN (R 4.4.0)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)
tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.0)
timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)
utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
vroom 1.6.5 2023-12-05 [1] CRAN (R 4.4.0)
withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0)
writexl * 1.5.0 2024-02-09 [1] CRAN (R 4.4.0)
xfun 0.44 2024-05-15 [1] CRAN (R 4.4.0)
yaml 2.3.8 2023-12-11 [1] CRAN (R 4.4.0)
[1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ